Algorithms for Approximate Minimization of the Difference Between 
Submodular Functions, with AppHcations 
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Abstract 

We extend the work of Narasimhan and 
Bilmes [30] for minimizing set functions 
representable difference between sub- 

modular functions. Similar to [30], our new 
algorithms are guaranteed to monotonically 
reduce the objective function at every step. 
We empirically and theoretically show that 
the per-iteration cost of our algorithms is 
much less than [30], and our algorithms can 
be used to efficiently minimize a difference 
between submodular functions under various 
combinatorial constraints, a problem not 
previously addressed. We provide compu- 
tational bounds and a hardness result on 
the multiplicative inapproximability of min- 
imizing the difference between submodular 
functions. We show, however, that it is 
possible to give worst-case additive bounds 
by providing a polynomial time computable 
lower-bound on the minima. Finally we show 
how a number of machine learning problems 
can be modeled as minimizing the difference 
between submodular functions. We experi- 
mentally show the validity of our algorithms 
by testing them on the problem of feature 
selection with submodular cost features. 



1 Introduction 



Discrete optimization is important to many areas of 
machine learning and recently an ever growing num- 
ber of problems have been shown to be expressible 
as submodular function minimization or maximization 
(e.g., [19, 23, 25, 28, 27, 29]). The class of submodular 
functions is indeed special since submodular function 
minimization is known to be polynomial time, while 
submodular maximization, although NP complete, ad- 
mits constant factor approximation algorithms. Let 
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V — {1, 2, • • • ,n} refer a ground set, then / : 2^ — 
R is said to be submodular if for sets S,T C V, 
f{S)+f(T) > f{SUT)+f{SnT) (see [11] for details on 
submodular, supermodular, and modular functions). 
Submodular functions have a diminishing returns prop- 
erty wherein the gain of an element in the context of 
bigger set is lesser than the gain of that element in the 
context of a smaller subset. This property occurs nat- 
urally in many applications in machine learning, com- 
puter vision, economics, operations research, etc. 

In this paper, we address the following problem. Given 
two submodular functions / and g, and define v{X) = 
f{X) — g{X), solve the following optimization problem: 



mm [/(X) - g{X)] 



min [w(X)]. 

xcv 



(1) 



A number of machine learning problems involve 
minimization over a difference between submodular 
functions. The following are some examples: 

• Sensor placement with submodular costs: 

The problem of choosing sensor locations A from 
a given set of possible locations V can be modeled 
[23, 24] by maximizing the mutual information 
between the chosen variables A and the unchosen 
set V\A (i.e., f{A) = I{Xa;Xv\a))- Alternatively, 
we may wish to maximize the mutual information 
between a set of chosen sensors Xa and a fixed 
quantity of interest C (i.e., f{A) = I{Xa;C)) 
under the assumption that the set of features Xa 
are conditionally independent given C [23]. These 
objectives are submodular and thus the problem 
becomes maximizing a submodular function subject 
to a cardinality constraint. Often, however, there 
are costs c{A) associated with the locations that 
naturally have a diminishing returns property. 
For example, there is typically a discount when 
purchasing sensors in bulk. Moreover, there may be 
diminished cost for placing a sensor in a particular 
location given placement in certain other locations 
(e.g., the additional equipment needed to install a 



sensor in, say, a precarious environment could be 
re-used for multiple sensor installations in like envi- 
ronments). Hence, along with maximizing mutual 
information, we also want to simultaneously min- 
imize the cost and this problem can be addressed 
by minimizing the difference between submodular 
functions f{A) — Xc{A) for tradeoff parameter A. 

• Discriminatively structured graphical mod- 
els and neural computation: An application 
suggested in [30] and the initial motivation for 
this problem is to optimize the EAR criterion to 
produce a discriminatively structured graphical 
model. EAR is basically a difference between two 
mutual information functions (i.e., a difference 
between submodular functions). [30] shows how 
classifiers based on discriminative structure using 
EAR can significantly outperform classifiers based 
on generative graphical models. Note also that the 
EAR measure is the same as "synergy" in a neural 
code [3], widely used in neuroscience. 

• Feature selection: Given a set of features 
A"i,A"2,--- ,A"n/|, the feature selection problem 
is to find a small subset of features Xa that 
work well when used in a pattern classifier. This 
problem can be modeled as maximizing the mutual 
information I{Xa',C) where C is the class. Note 
that I{Xa; C)^H{Xa)~ H[Xa\C) is always a dif- 
ference between submodular functions. Under the 
nai've Bayes model, this function is submodular [23]. 
It is not submodular under general classifier models 
such as support vector machines (SVMs) or neural 
networks. Certain features, moreover, might be 
cheaper to use given that others are already being 
computed. For example, if a subset S'i C y of the 
features for a particular information source i are 
spectral in nature, then once a particular v £ Si 
is chosen, the remaining features Si \ {w} may be 
relatively inexpensive to compute, due to grouped 
computational strategies such as the fast Fourier 
transform. Therefore, it might be more appropriate 
to use a submodular cost model c{A) . One such cost 
model might be c{A) — ^ - y/m{A n Si) where m{j) 
would be the cost of computing feature j. Another 
might be c{A) = ^-CimindAn 5^1,1) where Ci is 
the cost of source i. Both offer diminishing cost for 
choosing features from the same information source. 
Such a cost model could be useful even under the 
nai've Bayes model, where I(Xa',C) is submodular. 
Feature selection becomes a problem of maximizing 
I{Xa; C) - \c{A) = H{Xa) - [H{Xa\C) + \c{A)], 
the difference between two submodular functions. 

• Probabilistic Inference: A typical instance of 
probabilistic inference is the following: We are given 
a distribution p[x) oc exp(— w(a;)) where x G {0, 1}" 



and w is a pseudo-Boolean function [2]. It is 
desirable to compute argmax^gjQ p(x) which 
means minimizing v{x) over the most-probable 
explanation (MPE) problem [33]. If p factors with 
respect to a graphical model of tree- width k, then 
v{x) = '^iVi{xc) where Ci is a bundle of indices 
such that |C| < k + 1 and the sets {Ci}i form a 
junction tree, and it might be possible to solve 
inference using dynamic programming. If k is 
large and/or if hypertree factorization does not 
hold, then approximate inference is typically used 
[38]. On the other hand, defining x{X) — {x G 
{0, 1}" : Xi = 1 whenever i g X}, if the set function 
v{X) = v{x{X)) is submodular, then even if p has 
large tree-width, the MPE problem can be solved 
exactly in polynomial time [17]. This, in fact, is the 
basis behind inference in many computer vision mod- 
els where v is often not only submodular but also 
has limited sized |Ci|. For example, for submodular 
V and if I Ci I < 2 then graph-cuts can solve the MPE 
problem extremely rapidly [22] and even some cases 
with V non-submodular [21]. An important chal- 
lenge is to consider non-submodular v that can be 
minimized efficiently and for which there are approx- 
imation guarantees, a problem recently addressed 
in [18]. On the other hand, if v can be expressed 
as a difference between two submodular functions 
(which it can, see Lemma 3.1), or if such a decom- 
position can be computed (which it sometimes can, 
see Lemma 3.2), then a procedure to minimize the 
difference between two submodular functions offers 
new ways to solve probabilistic inference. 

We note that given a solution to Equation 1, we can 
also minimize the difference between two supermodu- 
lar functions min((— g) — (—/)), maximize the differ- 
ence between two submodular functions max(— ti) = 
max((7 — /), and maximize the difference between 
two supermodular functions max(— w) = max((— /) — 
(-5)). 

Previously, Narasimhan and Bilmes [30] proposed 
an algorithm inspired by the convex-concave proce- 
dure [39] to address Equation (I). This algorithm 
iteratively minimizes a submodular function by 
replacing the second submodular function g by it's 
modular lower bound. They also show that any set 
function can be expressed as a difference between two 
submodular functions and hence every set function 
optimization problem can be reduced to minimizing a 
difference between submodular functions. They show 
that this process converges to a local minima, however 
the convergence rate is left as an open question. 

In this paper, we first describe tight modular bounds 
on submodular functions in Section 2, including lower 
bounds based on points in the base polytope as used 



in [30] , and recent upper bounds first described in a re- 
sult in [15]. In section 2.2, we describe the submodular- 
supermodular procedure proposed in [30]. We further 
provide a constructive procedure for finding the sub- 
modular functions / and g for any arbitrary set func- 
tion V. Although our construction is NP hard in gen- 
eral, we show how for certain classes of set functions v, 
it is possible to find the decompositions / and g in poly- 
nomial time. In Section 4, we propose two new algo- 
rithms both of which are guaranteed to monotonically 
reduce the objective at every iteration and which con- 
verge to a local minima. Further we note that the per- 
iteration cost of our algorithms is in general much less 
than [30], and empirically verify that our algorithms 
are orders of magnitude faster on real data. We show 
that, unlike in [30], our algorithms can be extended to 
easily optimize equation (1) under cardinality, knap- 
sack, and matroid constraints. Moreover, one of our 
algorithms can actually handle complex combinatorial 
constraints, such as spanning trees, matchings, cuts, 
etc. Further in Section 5, we give a hardness result 
that there does not exist any polynomial time algo- 
rithm with any polynomial time multiplicative approx- 
imation guarantees unless P=NP, even when it is easy 
to find or when we are given the decomposition / and 
thus justifying the need for heuristic methods to solve 
this problem. We show, however, that it is possible to 
get additive bounds by showing polynomial time com- 
putable upper and lower bound on the optima. We also 
provide computational bounds for all our algorithms 
(including the submodular-supermodular procedure), 
a problem left open in [30]. 

Finally we perform a number of experiments on the 
feature selection problem under various cost models, 
and show how our algorithms used to maximize the 
mutual information perform better than greedy selec- 
tion (which would be near optimal under the naive 
Bayes assumptions) and with less cost. 

2 Modular Upper and Lower bounds 

The Taylor series approximation of a convex function 
provides a natural way of providing lower bounds on 
such a function. In particular the first order Taylor 
series approximation of a convex function is a lower 
bound on the function, and is linear in x for a given 
y and hence given a convex function 0, we have: 

0(3^) >0(2/) + (V</)(y),x-2/). (2) 

Surprisingly, any submodular function has both a tight 
lower [7] and upper bound [15], unlike strict convexity 
where there is only a tight first order lower bound. 



2.1 Modular Lower Bounds 

Recall that for submodular function /, the submodu- 
lar polymatroid, base polytope and the sub-differential 
with respect to a set Y [11] are respectively: 

Vf^{x:x{S)<f{S),ySCV} (3) 

Bf=rfn{x: x{V) = f{V)} (4) 

dfiY) - {y e : C /(F) - y{Y) < f{X) - y{X)} 

The extreme points of this sub-differential are easy 
to find and characterize, and can be obtained from a 
greedy algorithm ([7, 11]) as follows: 
Theorem 2.1. ([11], Theorem 6.11) A point y is 
an extreme point of df{Y), iff there exists a chain 
= S'o C 5*1 C • • • C S'n with Y — Sj for some j, such 
that y{S, \ 5,_i) = yiS,) - y(5,_i) = /(S,) - /(5,-i). 

Let cr be a permutation of V and define — 
{(t(1), cr(2), . . . , a{i)} as cr's chain containing Y , mean- 
ing ^i^i — Y (we say that cr's chain contains Y). Then 

we can define a sub-gradient hy corresponding to / as: 

, f , /-NN if i = 1 

"^'^ \/(^f)-/(^f-i) otherwise 

We get a modular lower bound of / as follows: 
hi^A^) < f{X)yX C V, and V*, h^JS^) - /(^f ), 

which is parameterized by a set Y and a permutation 
a. Note h{X) = E^exH^)^ and h^y jY) = f{Y). 
Observe the similarity to convex functions, where a 
linear lower bound is parameterized by a vector y. 

2.2 Modular Upper Bounds 

For / submodular, [31] established the following: 

fiY)<fix)- ^ f{j\x\j)+ J2 fij\xnY), 

]ex\Y ]eY\x 

f{Y)<f{X)- J2 f{j\{XiiY)\j)+ J2 fij\X) 
]£X\Y jeY\x 

Note that fiA\B) = f{A U B) - f{B) is the gain of 
adding A in the context of B. These upper bounds 
in fact characterize submodular functions, in that 
a function / is a submodular function iff it follows 
either of the above bounds. Using the above, two tight 
modular upper bounds ([15]) can be defined as follows: 

jeX\Y ]&Y\X 

f{Y)<m{^,{Y)^f{X)-Y^ f{j\V\j)+ E fm- 

jex\Y j£Y\X 

Hence, this yields two tight (at set X) modular upper 
bounds T^xiT^^X2 ^'^'^ any submodular function /. 
For briefness, when referring either one we use m;^. 



3 Submodular-Supermodular Procedure 

We now review the submodular-supermodular pro- 
cedure [30] to minimize functions expressible as a 
difference between submodular functions (henceforth 
called DS functions). Interestingly, any set function 
can be expressed as a DS function using suitable 
submodular functions as shown below. The result was 
first shown in [30] using the Lovasz extension. We here 
give a new combinatorial proof, which avoids Hessians 
of polyhedral convex functions and which provides a 
way of constructing (a non-unique) pair of submodular 
functions / and g for an arbitrary set function v. 

Lemma 3.1. [30] Given any set function v, it can 
be expressed as a DS functions v{X) — f{X) — 
g{X),\/X C V for some submodular functions f and g. 

Proof. Given a set function v, we can define 
a = Ta.uixc_Ycv\jV{j\X) - v{j\YY . Clearly a < 0, 
since otherwise v would be submodular. Now 
consider any (strictly) submodular function g, i.e., 
one having /3 = ^'^^x<zy<zv\] 9{i\X) - ff(j|y) > 0. 
Define f'{X) ^ v{X) + ^-^g{X) with any a' < a. 
Now it is easy to see that /' is submodular since 
inmxcYcv\j f'U\X) - f'{j\Y) >a+ \a'\ > 0. Hence 
v{X) — f'{X) — ^-^g{X), is a difference between two 
submodular functions. □ 

The above proof requires the computation of a and /3 
which has, in general, exponential complexity. Using 
the construction above, however, it is easy to find the 
decomposition / and g under certain conditions on v. 

Lemma 3.2. If a or at least a lower bound on a for 
any set function v can be computed in polynomial time, 
functions f and g corresponding to v can obtained in 
polynomial time. 

Proof Define g as g{X) = y^\. Then (3 = 
mmxcYCV\jV\X\TT- ^\ - + 1 + ^\ = 

mmxcv\j VM+i- VW\- VM+^+ VM+T = 

1\Jn — 1 — ^Jn — y/n — 2. The last inequality follows 
since the smallest difference in gains will occur at 
|X| = n~2. Hence (3 is easily computed, and given a 
lower bound on a, from Lemma 3.1 the decomposition 
can be obtained in polynomial time. A similar 
argument holds for g being other concave functions 
over \X\. □ 

The submodular supermodular (SubSup) procedure is 
given in Algorithm 1. At every step of the algorithm, 
we minimize a submodular function which can be per- 
formed in strongly polynomial time [32, 35] although 

'We denote j, X, F : X C F C V\{j} by X C F C V\j. 



Algorithm 1 The submodular-supermodular (Sub- 
Sup) procedure [30] 

1: X" = ; t ^ ; 

2: while not converged (i.e., [X*-^^ ^ ^*)) do 

3: Randomly choose a permutation cr* whose chain 

contains the set AT*. 
4: A*+i := argmin;^ /(A) - h'^,^^, (A) 
5: t^t + 1 
6: end while 



the best known complexity is 0{n^r] + n^) where rj is 
the cost of a function evaluation. Algorithm 1 is guar- 
anteed to converge to a local minima and moreover the 
algorithm monotonically decreases the function objec- 
tive at every iteration, as we show below. 

Lemma 3.3. [30j Algorithm 1 is guaranteed to 
decrease the objective function at every iteration. 
Further, the algorithm is guaranteed to converge to a 
local minima by checking at most 0{n) permutations 
at every iteration. 

Proof. The objective reduces at every iteration since: 

/(A*+i) - 5(A*+i) < /(X*+i) - h<>^,^^, (A*+i) 

< f{X')-h%,^^,{X') 
^ /(A*)-g(A*) 

Where (a) follows since /i^'.^* (^*^^) ^ and 
(b) follows since A*+^ is the minimizer of /(A) — 
^x*,<T'(^)' and (c) follows since /i^t „t(A*) = g{X*) 
from the tightness of the modular lower bound. 

Further note that, if there is no improvement in the 
function value by considering 0{n) permutations each 
with different elements at cr*(|A*| - 1) and cr*(|A*| -|- 
1), then this is equivalent to a local minima con- 
dition on V since h'^x* ,(yti^\xt\+i) = and 

Algorithm 1 requires performing a submodular func- 
tion minimization at every iteration which while poly- 
nomial in n is (due to the complexity described above) 
not practical for large problem sizes. So while the algo- 
rithm reaches a local minima, it can be costly to find 
it. A desirable result, therefore, would be to develop 
new algorithms for minimizing DS functions, where the 
new algorithms have the same properties as the Sub- 
Sup procedure but are much faster in practice. We 
give this in the following sections. 



4 Alternate algorithms for minimizing 
DS functions 

In this section we propose two new algorithms to min- 
imize DS functions, both of which are guaranteed to 
monotonically reduce the objective at every iteration 
and converge to local minima. We briefly describe 
these algorithms in the subsections below. 

4.1 The supermodular-submodular (SupSub) 
procedure 

In the submodular-supcrmodular procedure we itera- 
tively minimized f{X) — g{X) by replacing g by it's 
modular lower bound at every iteration. We can in- 
stead replace / by it's modular upper bound as is 
done in Algorithm 2, which leads to the supermodular- 
submodular procedure. 



Algorithm 2 The supermodular-submodular (Sup- 
Sub) procedure 

1: X" ; t ^ ; 

2: while not converged (i.e., (A"*+^ 7^ X*)) do 
3: A:*+i := argmin_,f m(^t (X) - g{X) 
4: t^t + l 
5: end while 



In the SupSub procedure, at every step we perform 
submodular maximization which, although NP com- 
plete to solve exactly, admits a number of fast constant 
factor approximation algorithms [8, 9]. Notice that we 
have two modular upper bounds and hence there are 
a number of ways we can choose between them. One 
way is to run both maximization procedures with the 
two modular upper bounds at every iteration in par- 
allel, and choose the one which is better. Here by 
better we mean the one in which the function value is 
lesser. Alternatively we can alternate between the two 
modular upper bounds by first maximizing the expres- 
sion using the first modular upper bound, and then 
maximize the expression using the second modular up- 
per bound. Notice that since we perform approximate 
submodular maximization at every iteration, we are 
not guaranteed to monotonically reduce the objective 
value at every iteration. If, however, we ensure that at 
every iteration we take the next step only if the objec- 
tive V does not increase, we will restore monotonicity 
at every iteration. Also, in some cases we converge to 
local optima as shown in the following theorem. 

Theorem 4.1. Both variants of the supermodular- 
submodular procedure (Algorithm 2) monotonically re- 
duces the objective value at every iteration. Moreover, 
assuming a submodular maximization procedure in 
line 3 that reaches a local maxima of m^^t{X) — g{X), 



then if Algorithm 2 does not improve under both mod- 
ular upper bounds then it reaches a local optima of v. 

Proof. For either modular upper bound, we have: 

/(X*+i) - 5(X*+i) < m^, (X*+i) - g(X*+i) 

<m^,(A*)-5(X*) 
^/(X*)-g(X*), 

where (a) follows since /(Ar*+^) < m;^t(X*+^), and 
(b) follows since we assume that we take the next step 
only if the objective value does not increase and (c) 
follows since m;^t(Ar*) = f{X*') from the tightness of 
the modular upper bound. 

To show that this algorithm converges to a local min- 
ima, we assume that the submodular maximization 
procedure in line 3 converges to a local maxima. Then 
observe that if the objective value does not decrease in 
an iteration under both upper bounds, it implies that 
r7i^t(Ar*) — g{X^) is already a local optimum in that 
(for both upper bounds) we have TO;^t(X*Uj)— g(Ar'U 
j) > ™i*(^*) - g{X'),yj i X* and m^,(X*\j) - 
giX\i) > m{t{X*) - giX*),yj e XK Note that 
mf^. ,{X\j) = /(X*) - /(j|X*\j) = f{X\3) and 

m^, 2(X* U j) = /(A*) + /(j|A*) = /(A* U j) and 
hence if both modular upper bounds are at a local 
optima, it implies /(A*) — g{X*) — m^^t i(A*) — 

.9(A*) < m^. i(A*\j)-5(^*\j) - f{X\j)-g{X\j). 
Similarly /(A*) - g(A*) = m^^, ^^{X^) - .g(A*) < 
m^, 2(A* U j) - g{X' U j) = /(A* U j) - g(A* U 3). 
Hence A* is a local optima for v{X) ~ /(A) — 5(A), 
since w(A*) < w(A* U j) and w(A*) < v{X\j). □ 

To ensure that we take the largest step at each iter- 
ation, we can use the recently proposed tight (1/2)- 
approximation algorithm in [8] for unconstrained non- 
monotone submodular function maximization — this 
is the best possible in polynomial time for the class of 
submodular functions independent of the P=NP ques- 
tion. The algorithm is a form of bi-directional ran- 
domized greedy procedure and, most importantly for 
practical considerations, is linear time [8]. In prac- 
tice we just use a combination of a form of a simple 
greedy procedure, and the bi-directional randomized 
algorithm, by picking the best amongst the two at ev- 
ery iteration. Since the randomized greedy algorithm 
is 1/2 approximate, the combination of the two proce- 
dures also will be 1/2 approximate. Lastly, note that 
this algorithm is closely related to a local search heuris- 
tic for submodular maximization [9]. In particular, if 
instead of using the greedy algorithm entirely at every 
iteration, we take only one local step, we get a local 



search heuristic. Hence, via the SupSub procedure, we 
may take larger steps at every iteration as compared 
to a local search heuristic. 

4.2 The modular- modular (ModMod) 
procedure 

The submodular-supermodular procedure and the 
supermodular-submodular procedure were obtained by 
replacing g by it's modular lower bound and / by it's 
modular upper bound respectively. We can however re- 
place both of them by their respective modular bounds, 
as is done in Algorithm 3. 



Algorithm 3 Modular-Modular (ModMod) proce- 
dure 

1: =0; t^O ; 

2: while not converged (i.e., ^ ^^)) do 

3: Choose a permutation cr* whose chain contains 
the set X*. 

4: := argmin^f m^,(A) - h%,^^,{X) 

6: end while 



In this algorithm at every iteration we minimize only 
a modular function which can be done in 0{n) time, 
so this is extremely easy (i.e., select all negative 
elements for the smallest minimum, or all non-positive 
elements for the largest minimum). Like before, since 
we have two modular upper bounds, we can use any 
of the variants discussed in the subsection above. 
Moreover, we are still guaranteed to monotonically 
decrease the objective at every iteration and converge 
to a local minima. 

Theorem 4.2. Algorithm 3 monotonically decreases 
the function value at every iteration. If the function 
value does not increase on checking 0{n) different per- 
mutations with different elements at adjacent positions 
and with both modular upper bounds, then we have 
reached a local minima of v. 

Proof. Again we can use similar reasoning as the ear- 
lier proofs and observe that: 

<r4,(A*)-/i^,_^,(X*) 
= /(A*)-g(X*) 

We see that considering 0{n) permutations each with 
different elements at cr*(|X*| - 1) and cr*(|X*| + 1), 
we essentially consider all choices of g{X*^ U j) and 
g{X\j), since ft.^t_^t(5|x*|+i) = f{S\xt\+i) and 
/i^t ^t{S\x*\-i) — f{S\x*\-i)- Since we consider both 



modular upper bounds, we correspondingly consider 
every choice of /(AT* U j) and f{X*\j). Note that at 
convergence we have that TO^t(A*) — /i^t ^t(X*) < 

m^t(A) - /i^, ^,(A),VA C V for 0{n) different per- 
mutations and both modular upper bounds. Corre- 
spondingly we are guaranteed that (since the expres- 
sion is modular) Vj ^ A*,w(j|A*) > and Vj E 
X\v{j\X*\j) > 0, where f (A) = f{X)-g{X). Hence 
the algorithm converges to a local minima. □ 

An important question is the choice of the per- 
mutation a* at every iteration A'. We observe 
experimentally that the quality of the algorithm de- 
pends strongly on the choice of permutation. Observe 
that f{X)-g{X) < m/,(X)-/i^,_^,(A), and/(A*)- 

5(A*) = m{t{X*) - h^^t^^tiX^)- Hence, we might 
obtain the greatest local reduction in the value of v by 
choosing permutation cr* S argmin^ minx(TO^*(A) — 
/i^t^t(A)), or the one which maximizes ft,^t^t(A). 
We in fact might expect that choosing cr* ordered 
according to greatest gains of g, with respect to A*, 
we would achieve greater descent at every iteration. 
Another choice is to choose the permutation a based 
on the ordering of gains of v (or even m^t). Through 
the former we are guaranteed to at least progress as 
much as the local search heuristic. Indeed, we observe 
in practice that the first two of these heuristics 
performs much better than a random permutation 
for both the ModMod and the SubSup procedure, 
thus addressing a question raised in [30] about which 
ordering to use. Practically for the feature selection 
problem, the second heuristic seems to work the best. 

4.3 Constrained minimization of a difference 
between submodular functions 

In this section we consider the problem of minimizing 
the difference between submodular functions subject 
to constraints. We first note that the problem of 
minimizing a submodular function under even simple 
cardinality constraints in NP hard and also hard to 
approximate [36]. Since there docs not yet seem to 
be a reasonable algorithm for constrained submodular 
minimization at every iteration, it is unclear how 
we would use Algorithm 1. However the problem of 
submodular maximization under cardinality, matroid, 
and knapsack constraints though NP hard admits 
a number of constant factor approximation algo- 
rithms [31, 26] and correspondingly the cardinality 
constraints can be easily introduced in Algorithm 2. 
Moreover, since a non-negative modular function can 
be easily, directly and even exactly optimized under 
cardinality, knapsack and matroid constraints [16], 
Algorithm 3 can also easily be utilized. In addition, 
since problems such as finding the minimum weight 



spanning tree, min-cut in a graph, etc., are polynomial 
time algorithms in a number of cases. Algorithm 3 
can be used when minimizing a non-negative function 
V expressible as a difference between submodular 
functions under combinatorial constraints. If v is 
non-negative, then so is its modular upper bound, 
and then the ModMod procedure can directly be used 
for this problem — each iteration minimizes a non- 
negative modular function subject to combinatorial 
constraints which is easy in many cases [16, 14]. 

5 Theoretical results 

In this section we analyze the computational and ap- 
proximation bounds for this problem. For simplic- 
ity we assume that the function v is normalized, i.e 
f (0) = 0. Hence we assume that v achieves it minima 
at a negative value and correspondingly the approxi- 
mation factor in this case will be less than 1. 

We note in passing that the results in this section 
are mostly negative, in that they demonstrate the- 
oretically how complex a general problem such as 
minx[f (X) — g{X)] is, even for submodular / and g. In 
this paper, rather than consider these hardness results 
pessimistically, we think of them as providing justifi- 
cation for the heuristic procedures given in Section 4 
and [30]. In many cases, inspired heuristics can yield 
good quality and hence practically useful algorithms 
for real-world problems. For example, the ModMod 
procedure (Algorithm 3) and even the SupSub proce- 
dure (Algorithm 2) can scale to very large problem 
sizes, and thus can provide useful new strategies for 
the applications listed in Section 1. 

5.1 Hardness 

Observe that the class of DS functions is essentially 
the class of general set functions, and hence the prob- 
lem of finding optimal solutions is NP-hard. This is 
not surprising since general set function minimization 
is inapproximable and there exist a large class of 
functions where all (adaptive, possibly randomized) 
algorithms perform arbitrarily poorly in polynomial 
time [37]. Clearly as is evident from Theorem 3.1, 
even the problem of finding the submodular functions 
/ and g requires exponential complexity. We moreover 
show in the following theorem, however, that this 
problem is multiplicatively inapproximable even when 
the functions / and g are easy to find. 
Theorem 5.1. Unless P = NP, there cannot ex- 
ist any polynomial time approximation algorithm for 
minx v{X) where v{X) — [f{X) — 9{X)] is a posi- 
tive set function and f and g are given submodular 
functions. In particular, let n be the size of the prob- 
lem instance, and a{n) > be any positive polyno- 



mial time computable function of n. If there exists 
a polynomial-time algorithm which is guaranteed to 
find a set X' : f{X') - g{X') < a{n)OPT, where 
OPT=mmx f{X) - g{X), then P = NP. 

Proof. We prove this by reducing this to the subset 
sum problem. Given a positive modular function m 
and a positive constant is there a subset S' C 1/ such 
that m{S) = tl First we choose a random set C (un- 
known to the algorithm), and define t = m{C). Define 
a set function u, such that v{S) = 1, if m{S) = t 
and v{S) = — o(l) otherwise. Observe that 

minsi;(S') — — o(l), since a{n) > 1. Note 

that a = mmxcYcv\] v{j\X) ~ v{j\Y) > 2(^ - 1). 
Hence we can easily compute a lower bound on a and 
hence from lemma 3.2 we can directly compute the 
decomposition / and g. In fact notice that the decom- 
position is directly computable since both a and /3 are 
known. 

Now suppose there exists a polynomial time algorithm 
for this problem with an approximation factor of a(n). 
This implies that the algorithm is guaranteed to find 
a set S, such that v{S) < 1. Hence this algorithm 
will solve the subset sum problem in polynomial time, 
which is a contradiction unless P — NP. □ 

In fact we show below that independent of the P = 
NP question, there cannot exist a sub-exponential 
time algorithm for this problem. The theorem below 
gives information theoretic hardness for this problem. 

Theorem 5.2. For any < e < 1, there cannot exist 
any deterministic (or possibly randomized) algorithm 
for imnx[f{X) — g{X)] (where f and g are given sub- 
modular functions ), that always finds a solution which 
is at most i times the optimal, in fewer than e'^ "^^ 
queries. 

Proof. For showing this theorem, we use the same 
proof technique as in [9]. Define two sets C and D, 
such that F = C U D and |C| = \D\ = n/2. We 
then define a set function v{S) which depends only on 
k = \Sr\C\ and I = \Sr\D\. In particular define v{S) = 
i, if |fc — /[ < en and v{S) — 1, if |fc — Z| > en. Again, 
we have a trivial bound on a here since v{i\X) > ^ — 1 
andv{j\Y) < 1 — 7. Hence, a — iLmnxcY(zv\j ^{j\X) — 
v{j\Y) > 2|1 — i|. Thus, for this set function, a decom- 
position V = f — g can easily be obtained (Lemma 3.2). 

Now, let the partition (C, D) be taken uniformly 
at random and unknown to the algorithm. The 
algorithm issues some queries S to the value oracle. 
Call S "unbalanced" if [S" n C| differs from \S n D\ 
by more than en. Recall the Chcrnoff bounds [1]: Let 
Yi,Y2, ■ ■ • ,Yt be independent random variables in 



[-1,1], such that E[i;] = 0, then: 
t 

Pr[J2Y,> X]<2e-^"/^'. (5) 
1=1 

Define Y, = I{i e S')[/(i e G D)]. Clearly e 

[—1, 1], and we can use the bounds above. Hence for 
any query S, the probability that S is unbalanced is at 
most 2e~'^ Thus, we can see that even after e*^ "/^ 
number of queries, the probability that the resulting 
set is unbalanced is still 2e~^ Hence any algorithm 
will query only balanced sets regardless of C and D, 
and consequently with high probability the algorithm 
will obtain i as the minimum, while the actual mini- 
mum is 1. Thus, such an algorithm will never be able 
to achieve an approximation factor better than ^ . □ 

Essentially the theorems above say that even when 
we are given (or can easily find) a decomposition 
such that v{X) — f{X) — g{X), there exist set 
functions such that any algorithm (either adaptive 
or randomized) will perform arbitrarily poorly and 
this problem is inapproximable. Hence any algorithm 
trying to find the global optimum for this problem [4] 
can only be exponential. 

5.2 Polynomial time lower and upper bounds 

The decomposition theorem of [6] shows that any sub- 
modular function can be decomposed into a modular 
function plus a monotone non-decreasing and totally 
normalized polymatroid rank function. Specifically, 
given submodular f,g we have f'{X) ^ f{X) — 
E,ex/(jinj) and g'{X) ^ g{X) - Ej^x9{j\V\j) 
with f',g' being totally normalized polymatroid rank 
functions. Hence we have: v{X) = f'{X) — g'{X) + 
k{X), with modular k{X) = Y.jex 

The algorithms in the previous sections are all based on 
repeatedly finding upper bounds for v. The following 
lower bounds directly follow from the results above. 

Theorem 5.3. We have the following two lower 
bounds on the minimizers of v{X) = f{X) — g{X): 

mmv{X) > niin/'(X) + k{X) - g'{V) 

mmv{X) > /'(0) - g'{V) + ^ min(A;(j),0) 

jev 

Proof. Notice that 

min/(X) - g{X) = mmf'{X) - g' (X) + k(X) 

>min(/'(X) + fc(X))-maxg'(X) 
= uA-af'{X) + k{X)- g'{V) 



To get the second result, we start from the bound 
above and loosen it as: 

vainf'{X) + k{X)-g\V) 

> min f'{X) + min k{X) - g'{V) 

= /'(0) + J2 ^Mv{j\V\j),0) - g'{V) 

= /'(0) + ^min(fc(.?),O)-5W (6) 

□ 

The above lower bounds essentially provide bounds on 
the minima of the objective and thus can be used to 
obtain an additive approximation guarantee. The al- 
gorithms described in this paper are all polynomial 
time algorithms (as we show below) and correspond- 
ingly from the bounds above we can get an estimate 
on how far we are from the optimal. 

5.3 Computational Bounds 

We now provide computational bounds for e- 
approximate versions of our algorithms. Note that this 
was left as an open question in [30]. Finding the local 
minimizer of DS functions is PLS complete since it 
generalizes the problem of finding the local optimum 
of the MAX-CUT problem [34]. Note that this triv- 
ially generalizes the MAX-CUT problem since if we 
set f{X) — and g{X) is the cut function, we get 
the max cut problem. However we show that an e- 
approximate version of this algorithm will converge in 
polynomial time. 

Definition 5.1. An e- approximate version of an it- 
erative monotone non- decreasing algorithm for min- 
imizing a set function v is defined as a version of 
that algorithm, where we proceed to step t + 1 only 
ifv{X*+^) < v{X*){l + e). 

Note that the e-approximate versions of algorithms 1, 
2 and 3, are guaranteed to converge to e-approximate 
local optima. An e-approximate local optima of a func- 
tion w is a set X, such that v{XUj) > w(A")(l-l-e) and 
v{X\j) > viX){l -f e). W.l.o.g., assume that X° = 0. 
Then we have the following computational bounds: 

Theorem 5.4. The e-approximate versions of al- 
gorithms 1, 2 and 3 have a worst case com- 
plexity of oC-SSMIIMIt)), where M = f'{%) + 
E,gvmin(«(j|U\j),0)-g'(y), m = «(Xi) and 0{T) 
is the complexity of every iteration of the algorithm 
( which corresponds to respectively the submodular min- 
imization, maximization, or modular minimization in 
algorithms 1, 2 and 3).. 

Proof. Observe that m = v{X^) < v{X'^) — 0. Corre- 
spondingly if w(X^) = 0, it implies that the algorithm 



has converged, and cannot improve (since we are as- 
suming our algorithms are e— approximate. Hence in 
this case the algorithm will converge in one iteration. 
Consider then the case of m < 0. Note also from The- 
orem 5.3 that M = /'(0) -I- 'Ejev'^Mvij\V\j),0) - 
g'{V) < and that minx f{X) - g{X) > M. Since 
we are guaranteed to improve by a factor by at least 
1 -|- e at every iteration we have that in k iterations: 
|m|(l + e)'= <\M\^ Q( iog(|M|/|m|) -^^ ^jg^ 

we assume that the complexity at every iteration is 
0{T) we get the above result. □ 

Observe that for the algorithms we use, 0{T) is 
strongly polynomial in n. The best strongly poly- 
nomial time algorithm for submodular function min- 
imization is 0{n^r] + n^) [32] (the lower bound is cur- 
rently unknown) . Further the worst case complexity of 
the greedy algorithm for maximization is 0{n^) while 
the complexity of modular minimization is just 0{n). 
Note finally that these are worst case complexities and 
actually the algorithms run much faster in practice. 
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Figure 1: Plot showing the accuracy rates vs. the num- 
ber of features on the Mushroom data set. 

6 Experiments 

We test our algorithms on the feature subset selection 
problem in the supervised setting. Given a set of fea- 
tures Xy = {Xi,X2, • ■ ■ , we try to find a subset 
of these features A which has the most information 
from the original set Xy about a class variable C under 
constraints on the size or cost of A. Normally the num- 
ber of features is quite large and thus the training 
and testing time depend on \V\. In many cases, how- 
ever, there is a strong correlation amongst features and 
not every feature is novel. We can thus perform train- 
ing and testing with a much smaller number of features 
\A\ while obtaining (almost) the same error rates. 

The question is how to find the most representative 
set of features A. The mutual information between the 
chosen set of features and the target class C, I{Xa; C), 
captures the relevance of the chosen subset of features. 
In most cases the selected features are not indepen- 
dent given the class C so the naive Bayes assumption 



is not applicable, meaning this is not a pure submodu- 
lar optimization problem. As mentioned in Section 1, 
I{Xa \ C) can be exactly expressed as a difference be- 
tween submodular functions H{Xa) and H{Xa\C). 

6.1 Modular Cost Feature Selection 

In this subsection, we look at the problem of maximiz- 
ing I{Xa\C) — A|A|, as a regularized feature subset 
selection problem. Note that a mutual information 
I[Xa', C) query can easily be estimated from the data 
by just a single sweep through this data. Further we 
have observed that using techniques such as Laplace 
smoothing helps to improve mutual information esti- 
mates without increasing computation. In these exper- 
iments, therefore, we estimate the mutual information 
directly from the data and run our algorithms to find 
the representative subset of features. 

We compare our algorithms on two data sets, i.e., the 
Mushroom data set [13] and the Adult data set [20] ob- 
tained from [10]. The Mushroom data set has 8124 ex- 
amples with 112 features, while the Adult data set has 
32,561 examples with 123 features. In our experiments 
we considered subsets of features of sizes between 5%- 
20% of the total number of features by varying A. We 
tested the following algorithms for the feature subset 
selection problem. We considered two formulations of 
the mutual information, one under naive Bayes, where 
the conditional entropy H{Xa\C) can be written as 
H{Xa\C) = Y^j^A H{Xi\C) and another where we do 
not assume such factorization. We call these two for- 
mulations factored and non-factored respectively. We 
then considered the simple greedy algorithm, of iter- 
atively adding features at every step to the factored 
and non-factored mutual information, which we call 
GrF and GrNF respectively. Lastly, we use the new 
algorithms presented in this paper on the non-factored 
mutual information. 

We then compare the results of the greedy algorithms 
with those of the three algorithms for this problem, 
using two pattern classifiers based on either a linear 
kernel SVM (using [5]) or a naive Bayes (NB) classifier. 
We call the results obtained from the supermodular- 
submodular heuristic as "SupSub" , the submodular- 
supermodular procedure [30] as "SubSup", and the 
modular-modular objective as "ModMod." In the 
SubSup procedure, we use the minimum norm point 
algorithm [12] for submodular minimization, and in 
the SubSup procedure, we use the optimal algorithm 
of [8] for submodular maximization. We observed 
that the three heuristics generally outperformed 
the two greedy procedures, and also that GRF can 
perform quite poorly, thus justifying our claim that 
the naive Bayes assumption can be quite poor. This 
also shows that although the greedy algorithm in that 



case is optimal, the features are correlated given the 
class and hence modeling it as a difference between 
submodular functions gives the best results. We also 
observed that the SupSub and ModMod procedures 
perform comparably to the SubSup procedure, while 
the SubSup procedure is much slower in practice. 
Comparing the running times, the ModMod and 
the SupSub procedure are each a few times slower 
then the greedy algorithm (ModMod is slower due 
computing the modular semigradients), while the 
SubSup procedure is around 100 times slower. The 
SubSup procedure is slower due to general submodular 
function minimization which can be quite slow. 

The results for the Mushroom data set are shown in 
Figure 1. We performed a 10 fold cross-validation 
on the entire data set and observed that when using 
all the features SVM gave an accuracy rate of 99.6% 
while the all-feature NB model had an accuracy rate 
of 95.5%. The results for the Adult database are in 
Figure 2. In this case with the entire set of features 
the accuracy rate of SVM on this data set is 83.9% 
and NB is 82.3%. 




(a) SVM (b) NB 

Figure 2: Plot showing the accuracy rates vs. the num- 
ber of features on the Adult data set. 

In the mushroom data, the SVM classifier significantly 
outperforms the NB classifier and correspondingly GrF 
performs much worse than the other algorithms. Also, 
in most cases the three algorithms outperform GrNF. 
In the adult data set, both the SVM and NB perform 
comparably although SWI outperforms NB. However 
in this case also we observe that our algorithms gener- 
ally outperform GrF and GrNF. 

6.2 Submodular cost feature selection 

We perform synthetic experiments for the feature sub- 
set selection problem under submodular costs. The 
cost model we consider is c{A) — ^ ■ ^/m{Af^Si). We 
partitioned V into sets {Si}i and chose the modular 
function m randomly. In this set of experiments, we 
compare the accuracy of the classifiers vs. the cost as- 
sociated with the choice of features for the algorithms. 
Recall, with simple (modular) cardinality costs the 
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Figure 3: Plot showing the accuracy rates vs. the cost 
of features for the Mushroom data set 
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Figure 4: Plot showing the accuracy rates vs. the cost 
of features for the Adult data set 

greedy algorithms performed decently in comparison 
to our algorithms in the adult data set, where the NB 
assumption is reasonable. However with submodular 
costs, the objective is no longer submodular even un- 
der the NB assumption and thus the greedy algorithms 
perform much worse. This is unsurprising since the 
greedy algorithm is approximately optimal only for 
monotone submodular functions. This is even more 
strongly evident from the results of the mushrooms 
data-set (Figure 3) 

7 Discussion 

We have introduced new algorithms for optimizing 
the difference between two submodular functions, 
provided new theoretical understanding that provides 
some justification for heuristics, have outlined applica- 
tions that can make use of our procedures, and have 
tested in the case of feature selection with modular 
and submodular cost features. Our new ModMod pro- 
cedure is fast at each iteration and experimentally does 
about as well as the SupSub and SubSup procedures. 
The ModMod procedure, moreover, can also be used 
under various combinatorial constraints, and therefore 
the ModMod procedure may hold the greatest promise 
as a practical heuristic. An alternative approach, not 
yet evaluated, would be to try the convex-concave 
procedure [39] on the Lovasz extensions of / and g 
since subgradients in such case are so easy to obtain. 
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